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Abstract 

In this paper, we introduce a new machine learning tlicory based on multi-channel par- 
allel adaptation for rule discovery. This theory is distinguished from the familiar parallel- 
distributed adaptation theory of neural networks in terms of channel-based convergence to 
the target rules. We show how to realize this theory in a learning system named CFRule. 
CFRule is a parallel weight-based model, but it departs from traditional neural computing 
in that its internal knowledge is comprehensible. Furthermore, when the model converges 
upon training, each channel converges to a target rule. The model adaptation rule is de- 
rived by multi-level parallel weight optimization based on gradient descent. Since, however, 
gradient descent only guarantees local optimization, a multi-channel regression-based op- 
timization strategy is developed to effectively deal with this problem. Formally, we prove 
that the CFRule model can explicitly and precisely encode any given rule set. Also, wc 
prove a property related to asynchronous parallel convergence, which is a critical element 
of the multi-channel parallel adaptation theory for rule learning. Thanks to the quanti- 
zability nature of the CFRule model, rules can be extracted completely and soundly via 
a threshold-based mechanism. Finally, the practical application of the theory is demon- 
strated in DNA promoter recognition and hepatitis prognosis prediction. 

Keywords: rule discovery, adaptation, optimization, regression, certainty factor, neural net- 
work, machine learning, uncertainty management, artificial intelligence. 

1 Introduction 

Rules express general knowledge about actions or conclusions in given circumstances and 
also principles in given domains. In the if-then format, rules are an easy way to represent 
cognitive processes in psychology and a useful means to encode expert knowledge. In another 
perspective, rules are important because they can help scientists understand problems and 
engineers solve problems. These observations would account for the fact that rule learning or 
discovery has become a major topic in both machine learning and data mining research. The 
former discipline concerns the construction of computer programs which learn knowledge or 
skill while the latter is about the discovery of patterns or rules hidden in the data. 

The fundamental concepts of rule learning are discussed in [16]. Methods for learning sets of 
rules include symbolic heuristic search [3, 5], decision trees [17-18], inductive logic programming 
[13], neural networks [2, 7, 20], and genetic algorithms [10]. A methodology comparison can 
be found in our previous work [9]. Despite the differences in their computational frameworks, 
these methods perform a certain kind of search in the rule space (i.e., the space of possible 
rules) in conjunction with some optimization criterion. Complete search is difficult unless 
the domain is small, and a computer scientist is not interested in exhaustive search due to 
its exponential computational complexity. It is clear that significant issues have limited the 
effectiveness of all the approaches described. In particular, we should point out that all the 
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algorithms except exhaustive search guarantee only local but not global optimization. For 
example, a sequential covering algorithm such as CN2 [5] performs a greedy search for a single 
rule at each sequential stage without backtracking and could make a suboptimal choice at 
any stage; a simultaneous covering algorithm such as ID3 [18] learns the entire set of rules 
simultaneously but it searches incompletely through the hypothesis space because of attribute 
ordering; a neural network algorithm which adopts gradient-descent search is prone to local 
minima. 

In this paper, we introduce a new machine learning theory based on multi-channel parallel 
adaptation that shows great promise in learning the target rules from data by parallel global 
convergence. This theory is distinct from the familiar parallel-distributed adaptation theory 
of neural networks in terms of channel-based convergence to the target rules. We describe 
a system named CFRule which implements this theory. CFRule bases its computational 
characteristics on the certain factor (CF) model [4, 22] it adopts. The CF model is a calculus 
of uncertainty mangement and has been used to approximate standard probability theory [1] 
in artificial intelligence. It has been found that certainty factors associated with rules can be 
revised by a neural network [6, 12, 15]. Our research has further indicated that the CF model 
used as the neuron activation function (for combining inputs) can improve the neural-network 
performance [8]. 

The rest of the paper is organized as follows. Section ^ describes the multi-channel rule 
learning model. Section ^ examines the formal properties of rule encoding. Section ^ derives 
the model parameter adaptation rule, presents a novel optimization strategy to deal with the 
local minimum problem due to gradient descent, and proves a property related to asynchronous 
parallel convergence, which is a critical element of the main theory. Section |5| formulates a rule 
extraction algorithm. Section ^ demonstrates practical applications. Then we draw conclusions 
in the final section. 



CFRule is a rule-learning system based on multi-level parameter optimization. The kernel of 
CFRule is a multi-channel rule learning model. CFRule can be embodied as an artificial neural 
network, but the neural network structure is not essential. We start with formal definitions 
about the model. 

Definition 2.1 The multi-channel rule learning model M is defined by k (k > 1) channels 
(Ch's), an input vector (Min), and an output (Mout) as follows: 



2 The Mult i- Channel Rule Learning Model 



M={Chi,Ch2,...,Chk,Min,Mout) 



(1) 



where — 1 < Mout ^ 1 <ind 




(2) 



such that d is the input dimensionality and — 1 < Xj < 1 for all i. 
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The model has only a single output because here we assume the problem is a single-class, 
multi-rule learning problem. The framework can be easily extended to the multi-class case. 

Definition 2.2 Each channel (Chj) is defined by an output weight (uj), a set of input weights 
(wji's), activation (4>j), and influence (ipj) as follows: 

Chj = {uj,WjQ,Wji,Wj2,...,Wjd,(f)j,tpj) (3) 

where wjq is the bias, < uj < 1, and — 1 < Wji < 1 for all i. The input weight vector 
{wji, ...,Wj(i) defines the channel's pattern. 

Definition 2.3 Each channel's activation is defined by 

<t>j = fci{WjO,WjiXi,Wj2X2, ...,WjdXd) (4) 

where fd is the CF-combining function [4, 22], as defined below. 

Definition 2.4 The CF-combining function is given by 

fcf{xi,x2,...,yi,y2,-) = f^f{xi,X2,...) + f~f{yi,y2,-) (5) 

where 

f+{xi,X2,...) = l-l[(.l-Xi) (6) 

i 

2/2, ...) = -i + 11(1 + (7) 

j 

Xi 's are nonnegative numbers and yj 's are negative numbers. 

As we will see, the CF-combining function contributes to several important computational 
properties instrumental to rule discovery. 

Definition 2.5 Each channel's influence on the output is defined by 

j/jj = uj(f)j (8) 

Definition 2.6 The model output Mgut is defined by 

Mout = /cf(V'l, "^2, ■■■,^k) (9) 



4 



We call the class whose rules to be learned the target class, and define rules inferring (or 
explaining) that class to be the target rules. For instance, if the disease diabetes is the target 
class, then the diagnostic rules for diabetes would be the target rules. Each target rule defines 
a condition under which the given class can be inferred. Note that we do not consider rules 
which deny the target class, though such rules can be defined by reversing the class concept. 
The task of rule learning is to learn or discover a set of target rules from given instances called 
training instances (data). It is important that rules learned should be generally applicable 
to the entire domain, not just the training data. How well the target rules learned from the 
training data can be applied to unseen data determines the generalization performance. 

Instances which belong to the target class are called positive instances, else, called negative 
instances. Ideally, a positive training instance should match at least one target rule learned and 
vice versa, whereas a negative training instance should match none. So, if there is only a single 
target rule learned, then it must be matched by all (or most) positive training instances. But if 
multiple target rules are learned, then each rule is matched by some (rather than all) positive 
training instances. Since the number of possible rule sets is far greater than the number of 
possible rules, the problem of learning multiple rules is naturally much more complex than 
that of learning single rules. 

In the multi-channel rule learning theory, the model learns to sort out instances so that 
instances belonging to different rules flow through different channels, and at the same time, 
channels are adapted to accommodate their pertinent instances and learn corresponding rules. 
Notice that this is a mutual process and it cannot occur all at once. In the beginning, the 
rules are not learned and the channels are not properly shaped, both information flow and 
adaptation are more or less random, but through self-adaptation, the CFRule model will 
gradually converge to the correct rules, each encoded by a channel. The essence of this paper 
is to prove this property. 

In the model design, a legitimate question is what the optimal number of channels is. This 
is just like the question raised for a neural network of how many hidden (internal computing) 
units should be used. It is true that too many hidden units cause data overfitting and make 
generalization worse [7]. Thus, a general principle is to use a minimal number of hidden units. 
The same principle can be equally well applied to the CFRule model. However, there is a 
difference. In ordinary neural networks, the number of hidden units is determined by the 
sample size, while in the CFRule model, the number of channels should match the number of 
rules embedded in the data. Since, however, we do not know how many rules are present in 
the data, our strategy is to use a minimal number of channels that admits convergence on the 
training data. 

The model's behavior is characterized by three aspects: 

• Information processing: Compute the model output for a given input vector. 

• Learning or training: Adjust channels' parameters (output and input weights) so that 
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the input vector is mapped into the output for every instance in the training data. 



• Rule extraction: Extract rules from a trained model. 
The first aspect has been described already. 

3 Model Representation of Rules 

The IF-THEN rule (i.e., If the premise, then the action) is a major knowledge representation 
paradigm in artificial intelligence. Here we make analysis of how such rules can be represented 
with proper semantics in the CFRule model. 

Definition 3.1 CFRule learns rules in the form of 

IF Af , Af,, ~'Ai , . . ., ~'Aj , . . ., THEN the target class with a certainty factor. 

where Af is a positive antecedent (in the positive form), Aj a negated antecedent (in the 
negative form), and -i reads "not." Each antecedent can be a discrete or discretized attribute 
(feature), variable, or a logic proposition. The IF part must not be empty. The attached 
certainty factor in the THEN part, called the rule CF, is a positive real < 1. 

The rule's premise is restricted to a conjunction, and no disjunction is allowed. The collection of 
rules for a certain class can be formulated as a DNF (disjunctive normal form) logic expression, 
namely, the disjunction of conjunctions, which implies the class. However, rules defined here 
are not traditional logic rules because of the attached rule CFs meant to capture uncertainty. 
We interpret a rule by saying when its premise holds (that is, all positive antecedents mentioned 
are true and all negated antecedents mentioned are false) , the target concept holds at the given 
confidence level. CFRule can also learn rules with weighted antecedents (a kind of fuzzy rules), 
but we will not consider this case here. 

There is increasing evidence to indicate that good rule encoding capability actually fa- 
cilitates rule discovery in the data. In the theorems that follow, we show how the CFRule 
model can explicitly and precisely encode any given rule set. We note that the ordinary 
sigmoid-function neural network can only implicitly and approximately does this. Also, we 
note although the threshold function of the perceptron model enables it to learn conjunc- 
tions or disjunctions, the non-differentiability of this function prohibits the use of an adaptive 
procedure in a multilayer construct. 



Theorem 3.1 For any rule represented by Definition \3. 1\ , there exists a channel in the CFRule 
model to encode the rule so that if an instance matches the rule, the channel's activation is 1, 
else 0. 

(Proof): This can be proven by construction. Suppose we implement channel j by setting the 
bias weight to 1, the input weights associated with all positive attributes in the rule's premise 
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to 1, the input weights associated with all negated attributes in the rule's premise to —1, the 
rest of the input weights to 0, and finally the output weight to the rule CF. Assume that each 
instance is encoded by a bipolar vector in which for each attribute, 1 means true and —1 false. 
When an instance matches the rule, the following conditions hold: Xj = 1 if Xi is part of the 
rule's premise, Xj = — 1 if -iXj is part of the rule's premise, and otherwise Xj can be of any 
value. For such an instance, given the above construction, it is true that wjiXi = 1 or for all 



i. Thus, the channel's activation (by Definition 2.3), 



(pj = fcfiWjO = l,WjiXi,Wj2X2, ■.■,WjdXd) (10) 

must be 1 according to fcj- On the other hand, if an instance does not match the rule, then 
there exists i such that WjiXi = — 1. Since wjo (the bias weight) = 1, the channel's activation 
is due to /c/. □ 

Theorem 3.2 Assume that rule CP's > 9 (0 < < 1). For any set of rules represented 
by Definition \3. 1\ , there exists a CFRule model to encode the rule set so that if an instance 
matches any of the given rules, the model output is > 6, else 0. 



we 



(Proof): Suppose there are k rules in the set. As suggested in the proof of Theorem 
construct k channels, each encoding a different rule in the given rule set so that if an instance 
matches, say rule j, then the activation of channel j is 1. In this case, since the channel's 
influence ipj is given by Uj(j)j (where Uj is set to the rule CF) and the rule CF > 6, it follows 
that Tpj > 6. It is then clear that the model output must he > 9 since it combines influences 
from all channels that > but at least one > 6. On the other hand, if an instance fails to 
match any of the rules, all the channels' activations are zero, so is the model output. □ 



4 Model Adaptation and Convergence 

In neural computing, the backpropagation algorithm [19] can be viewed as a multilayer, par- 
allel optimization strategy that enables the network to converge to a local optimum solution. 
The black-box nature of the neural network solution is reflected by the fact that the pattern 
(the input weight vector) learned by each neuron does not bear meaningful knowledge. The 
CFRule model departs from traditional neural computing in that its internal knowledge is 
comprehensible. Furthermore, when the model converges upon training, each channel con- 
verges to a target rule. How to achieve this objective and what is the mathematical theory 
are the main issues to be addressed. 

4.1 Model Training Based on Gradient Descent 

The CFRule model learns to map a set of input vectors (e.g., extracted features) into a set 
of outputs (e.g., class information) by training. An input vector along with its target output 
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constitute a training instance. The input vector is encoded as a 1/ — 1 bipolar vector. The 
target output is 1 for a positive instance and for a negative instance. 

Starting with a random or estimated weight setting, the model is trained to adapt itself 
to the characteristics of the training instances by changing weights (both output and input 
weights) for every channel in the model. Typically, instances are presented to the model one 
at a time. When all instances are examined (called an epoch), the network will start over with 
the first instance and repeat. Iterations continue until the system performance has reached a 
satisfactory level. 

The learning rule of the CFRule model is derived in the same way as the backpropagation 
algorithm [19]. The training objective is to minimize the sum of squared errors in the data. 
In each learning cycle, a training instance is given and the weights of channel j (for all j) are 
updated by 

Uj{t + 1) = Uj{t) + Auj (11) 

Wji{t + 1) = Wji{t) + Awji (12) 

where uj: the output weight, wji: an input weight, the argument t denotes iteration t, and A 
the adjustment. The weight adjustment on the current instance is based on gradient descent. 
Consider channel j. For the output weight (uj), 

Auj = -r]{dE/duj) (13) 

(rj: the learning rate) where 



1 



E = -{Tout - M, 



out ) 



\2 



2 

{Tout- the target output, Mout'- tlie model output). Let 

D = Tout — Mout 

The partial derivative in Eq. (^) can be rewritten with the calculus chain rule to yield 

dE/duj = idE/dMout){dMout/duj) = -D{dMout/duj) 



Then we apply this result to Eq. (13) and obtain the following definition. 
Definition 4.1 The learning rule for output weight uj of channel j is given by 

Auj = rjDidMout/duj) (14) 
For the input weights (wji^s), again based on gradient descent, 

Awji = -r]{dE/dwji) (15) 
The partial derivative in Eq. (^) is equivalent to 

dE/dwji = {dE/d(t)j){d(l)j/dwji) 



8 



Since (pj is not directly related to E, the first partial derivative on the right hand side of the 
above equation is expanded by the chain rule again to obtain 



dE/dcPj = {dE/dMout){dMout/d^j) = -D{dMout/d^j) 
Substituting these results into Eq. (^) leads to the following definition. 
Definition 4.2 The learning rule for input weight wji of channel j is given by 

Awji = r]dj{d(j)j /dwji) (16) 

where 



dj = D{dMout/ 
Assume that 

= ftii^i^^^^'^i'i^'i^ ■■■^'^jd'Xd') + fcfiwjd'+iXd'+i, ■■■,WjdXd) (17) 
Suppose d' > 1 and d — d' > 1. The partial derivative can be computed as follows. 
Case (a) If wjiXi > 0, 



( n - Wjixi))xi (18) 



Case (b) If wjiXi < 0, 



( n {'^ + Wjixi))xi (19) 



It is easy to show that if d' = 1 in case (a) or d — d' = 1 in case (b), = Xj. 
4.2 Multi-Channel Regression-Based Optimization 

It is known that gradient descent can only find a local-minimum. When the error surface is flat 
or very convoluted, such an algorithm often ends up with a bad local minimum. Moreover, the 
learning performance is measured by the error on unseen data independent of the training set. 
Such error is referred to as generalization error. We note that minimization of the training 
error by the backpropagation algorithm does not guarantee simultaneous minimization of 
generalization error. What is worse, generalization error may instead rise after some point 
along the training curve due to an undesired phenomenon known as overfitting [7]. Thus, 
global optimization techniques for network training (e.g., [21]) do not necessarily offer help as 
far as generalization is concerned. To address this issue, CFRule uses a novel optimization 
strategy called multi-channel regression-based optimization (MCRO). 
In Definition 2A, f^ and f~f can also be expressed as 

fcfixi,x2, ...) = J2^i + - ••■ (20) 
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i i j i j k 

When the arguments (xj's and ?/i's) are small, the CF function behaves somewhat like a 
linear function. It can be seen that if the magnitude of every argument is < 0.1, the first 
order approximation of the CF function is within an error of 10% or so. Since when learning 
starts, all the weights take on small values, this analysis has motivated the MCRO strategy 
for improving the gradient descent solution. The basic idea behind MCRO is to choose a 
starting point based on the linear regression analysis, in contrast to gradient descent which 
uses a random starting point. 

If we can use regression analysis to estimate the initial influence of each input variable on 
the model output, how can we know how to distribute this estimate over multiple channels? 
In fact, this is the most intricate part of the whole idea since each channel's structure and 
parameters are yet to be learned. The answer will soon be clear. 

In CFRule, each channel's activation is defined by 



4>j = fcf{'WjO,WjiXi,Wj2X2, ■■■) 



(22) 



Suppose we separate the linear component from the nonlinear component (R) in (pj to obtain 



WjiXi) + Rj 



1=0 



We apply the same treatment to the model output (Definition 2.6) 

Mout = fci{ui4>l,U2(t>2, •••) 

SO that 

k 

Mout = (Y^ Uj(j)j) + Rout 

i=i 

Then we substitute Eq.(p^ into Eq.(^) to obtain 

k d 

Mout = Y UjWjiXi) + Race 

j=l i=0 



(23) 

(24) 
(25) 

(26) 



in which the right hand side is equivalent to 

d k 

^(YujWji)x.i\+Ra 
1=0 j=l 

Note that 

k 

Race = (Y '^j^j) + -^o 

i=i 

Suppose linear regression analysis produces the following estimation equation for the model 
output: 

M'^^t = hQ + hiXi + ... 
(all the input variables and the output transformed to the range from to 1). 



''Out 
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Table 1: The target rules in the simulation experiment. 



rule 1: 


IF 


Xi 


and -1X2 


and x^ 


THEN the target concept 


rule 2: 


IF 


Xl 


and -1X4 


and X5 


THEN the target concept 


rule 3: 


IF 


x& 


and xii 




THEN the target concept 



Table 2: Comparison of the MCRO strategy with random start for the convergence to the 
target rules. The results were validated by the statistical t test with the level of significance 
< 0.01 and < 0.025 (degrees of freedom = 48) for the training and test error rates upon 
convergence, respectively. 





MCRO 


Random Start 


t- Value 


Level of Significance 


Train error rate mean 


0.010 


0.026 


2.47 


0.01 


Test error rate mean 


0.012 


0.033 


2.34 


0.025 



Definition 4.3 The MCRO strategy is defined by 

k 

Y^Ujit = 0)wji{t = 0) = b^ (27) 
i=i 

for each i,0 < i < d 

That is, at iteration t = when learning starts, the initial weights are randomized but subject 
to these d + 1 constraints. 

To demonstrate this strategy, we designed an experiment. Assume there were 20 input 
variables and three targets rules as shown in Table |l[ The training and test data sets were 
generated independently, each consisting of 100 random instances. An instance was classified as 
positive if it matched any of the target rules and as negative otherwise. The CFRule model for 
this experiment comprised three channels. The model was trained under MCRO and random 
start separately. For each strategy, 25 trials were run, each with a different initial weight 
setting. The same learning rate and stopping condition were used in every trial regardless of 
the strategy taken. The training and test error rates were measured. If the model converged to 
the target rules, then both training and test errors should be close to zero. We used the t test 
(one-sided hypothesis testing based on the statistical t distribution) to evaluate the difference 
in the means of error rates produced under the two strategies. Given the statistical validation 
result (as summarized in Table ^, we can conclude that MCRO is a valid technique. 
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4.3 Asynchronous Parallel Convergence 

In the multi-channel rule learning theory, there are two possible modes of parallel convergence. 
In the synchronous mode, all channels converge to their respective target patterns at the same 
time, whereas in the asynchronous mode, each channel converges at a different time. In a 
self-adaptation or self-organization model without a global clock, the synchronous mode is not 
a plausible scenario of convergence. On the other hand, the asynchronous mode may not arrive 
at global convergence (i.e., every channel converging to its target pattern) unless there is a 
mechanism to protect a target pattern once it is converged upon. Here we examine a formal 
property of CFRule on this new learning issue. 

Theorem 4.1 Suppose at time t, channel j of the CFRule model has learned an exact pattern 
{wji,Wj2, ...,Wjd) (d > 1) such that wjo (the bias) = 1 and wji = 1 or —1 or for 1 < i < d. At 
time t+1 when the model is trained on a given instance with the input vector {xq,Xi,X2, ■■■^Xd) 
(xq = 1 and Xi = 1 or —1 for all 1 < i < d), the pattern is unchanged unless there is a single 
mismatched weight (weight wji is mismatched if and only if WjiXi = —1). Let Awji{t + 1) be 
the weight adjustment for Wji . Then 

(a) If there is no mismatch, then Awji{t -|- 1) = for all i. 

(b) If there are more than one mismatched weight then Awji{t -|- 1) = for all i. 

(Proof): In case (a), there is no mismatch, so WjiXi = 1 or for all i. There exists / such that 
WjiXi = 1 and / 7^ i, for example, WjqXq = 1 as given. From Eq. (|l8|). 



d 

/9,„ ( n - wjm))x^ = 



Then from Eq. (^), 



Aw,i{t + l)=r^dj{^) = 

OWji 



In case (b), the proof for matched weights is the same as that in case (a). Consider only 
mismatched weights Wji^s such that wjiXi = —1. Since there are at least two mismatched 
weights, there exists / such that wjiXi = — 1 and / 7^ i. From Eq. (p^), 



( Yl (1 + Wjixi))xi = 

Therefore, 



■> WjlXl=-l,ljtt 



Awjiit + 1) = r]dj{^) = 



In the case of a single mismatched weight. 



dwji 
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which is not zero, so the weight adjustment Awji{t + 1) may or may not be zero, depending 
on the error dj. □ 

Since model training starts with small weight values, the initial pattern associated with 
each channel cannot be exact. When training ends, the channel's pattern may still be inexact 
because of possible noise, inconsistency, and uncertainty in the data. However, from the proof 
of the above theorem, we see that when the nonzero weights in the channel's pattern grow 
larger, the error derivative (dj^) generally gets smaller, so does the weight adjustment, and 
as a result, the pattern becomes more stable and gradually converges to a target pattern. A 
converged pattern does not move unless there is a near-miss instance (with a single feature 
mismatch against the pattern) that causes some error in the model output, in which case, 
the pattern is refined to be a little more general or specific. This analysis explains how the 
CFRule model ensures the stability of a channel once it is settled in a target pattern. Note 
that the output weight of a channel with a stable pattern can still be modified toward global 
error minimization and uncertainty management. In asynchronous parallel convergence, each 
channel is settled in its own target pattern with a different time frame. Without the above 
pattern stabilizing property, global convergence is difficult to achieve in the asynchronous 
mode. This line of arguments imply that CFRule admits asynchronous parallel convergence. 



Theorem 4.1 is unique for CFRule. That property has not been provable for other types of 
neural networks or learning methods (e.g., [16]). 

Asynchronous parallel convergence for rule learning can be illustrated by the example in 
Section |4.2| . Table |3| shows how each channel converges to a target rule in the training course 
when the model was trained on just 100 random instances (out of 2'^^ possible instances). 
For instance, given in the premise of rule 1 (Table |l]), we observe the corresponding 
weight wi^2 of channel 1 converged to —1 (Table also, for xq mentioned in rule 3, we 
see the weight w^^q of channel 3 converged to 1. Only the significant weights that converge 
to a magnitude of 1 are shown. Unimportant weights ending up with about zero values are 
omitted. The convergence behavior can be better visualized in Figure |l|. It clearly shows 
that convergence occurs asynchronously for each channel. It does not matter which channel 
converges to which rule. This correspondence is determined by the initial weight setting and 
the data characteristics. Note that given k channels in the model, there are k\ equivalent 
permutations in terms of their relative positions in the model. It matters, though, whether 
the model as a whole converges to all the needed target rules. 

5 Rule Extraction 



As illustrated by the example in Section |4^, when a channel converges to a target rule, the 
weights associated with the input attributes contained in the rule's premise grow into large 
values, whereas the rest of input weights decay to small values. The asymptotic absolute 
weight values upon convergence approach either 1 or ideally, but this case does not nec- 
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Table 3: Asynchronous parallel convergence to the target rules in the CFRule model. Channels 

1, 2, 3 converge to target rules 1, 2, 3, respectively, wj^i denotes the input weight associated 
with the input Xi in channel j. An epoch consists of a presentation of all training instances. 



epoch 


"'i.i 


"'L2 


"'1,7 


W'2.1 


"'2,i 


"•'2,5 


"':-!.(i 


"'.3,11 


1 


.016 


-.073 


.005 


.144 


-.166 


.087 


.387 


.479 


5 


.202 


-.249 


.133 


.506 


-.348 


.385 


1.00 


.948 


10 


.313 


-.420 


.256 


.868 


-.719 


.725 


1.00 


1.00 


15 


.462 


-.529 


.440 


1.00 


-.920 


.893 


1.00 


1.00 


20 


.851 


-.802 


.789 


1.00 


-.983 


1.00 


1.00 


1.00 


25 


1.00 


-.998 


.996 


1.00 


-1.00 


1.00 


1.00 


1.00 


30 


1.00 


-1.00 


1.00 


1.00 


-1.00 


1.00 


1.00 


1.00 




5 10 15 20 25 30 

Epoch 

Figure 1: The temporal curves of asynchronous parallel convergence for rule learning. 
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essarily happen in practical circumstances involving data noise, inconsistency, uncertainty, 
and inadequate sample sizes. However, in whatever circumstances, it turns out that a simple 
thresholding mechanism suffices to distinguish important from unimportant weights in the 
CFRule model. Since the weight absolute values range from to 1, it is reasonable to use 
0.5 as the threshold, but this value does not always guarantee optimal performance. How 
to search for a good threshold in a continuous range is difficult. Fortunately, thanks to the 
quantizability nature of the system adopting the CF model [9], only a handful of values need 
to be considered. Our research has narrowed it down to four candidate values: 0.35, 0.5, 0.65, 
and 0.8. A larger threshold makes extracted rules more general, whereas a smaller threshold 
more specific. In order to lessen data overfitting, our heuristic is to choose a higher value as 
long as the training error is acceptable. Using an independent cross-validation data set is a 
good idea if enough data is available. The rule extraction algorithm is formulated below. 



The CFRule Rule Extraction Algorithm 

• Select a rule extraction threshold r (0 < r < 1). 

• For each channel j, 

1. P -.= 7111 (an empty set) 

2. C :=the target class 

3. Normalize the input weights WjiS so that the maximum weight absolute value is 1. 

4. For each input weight Wji {1 < i < d, d: the input dimensionality), 

a. If Wji > r, then add Xi to P. 
h. If Wji < — r, then add -iXj to P. 
c. Else, do nothing. 

5. Form a rule: "IF P, THEN C with CF = {uj: the output weight based on the 
rule) . 

• Remove subsumed rules and rules with low CFs. 



The threshold-based algorithm described here is fundamentally different from the search- 
based algorithm in neural network rule extraction [7, 9, 20]. The main advantage with the 
threshold-based approach is its linear computational complexity with the total number of 
weights, in contrast to polynomial or even exponential complexity incurred by the search-based 
approach. Furthermore, the former approach obviates the need of a special training, pruning. 
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Table 4: The promoter (of prokaryotes) consensus sequences. 



Region 


DNA Sequence Pattern 


Minus-35 


@-36=T @-35=T @-34=G @-33=A 
@-32=C @-31=A 


Minus-10 


@-13=T @-12=A @-ll=T @-10=A 
@-9=A @-8=T 



or approximation procedure commonly used in the latter approach for complexity reduction. 
As a result, the threshold-based, direct approach should produce better and more reliable 
rules. Notice that this approach is not applicable to the ordinary sigmoid- function neural 
network where knowledge is entangled. The admissibility of the threshold-based algorithm for 
rule extraction in CFRule can be ascribed to the CF-combining function. 

6 Applications 

Two benchmark data sets were selected to demonstrate the value of CFRule on practical 
domains. The promoter data set is characterized by high dimensionality relative to the sample 
size, while the hepatitis data has a lot of missing values. Thus, both pose a challenging 
problem. 

The decision-tree-based rule generator system C4.5 [18] was taken as a control since it (and 
with its later version) is the currently most representative (or most often used) rule learning 
system, and also the performance of C4.5 is optimized in a statistical sense. 

6.1 Promoter Recognition in DNA 

In the promoter data set [23], there are 106 instances with each consisting of a DNA nucleotide 
string of four base types: A (adenine), G (guanine), C (cytosine), and T (thymine). Each 
instance string is comprised of 57 sequential nucleotides, including fifty nucleotides before 
(minus) and six following (plus) the transcription site. An instance is a positive instance if the 
promoter region is present in the sequence, else it is a negative instance. There are 53 positive 
instances and 53 negative instances, respectively. Each position of an instance sequence is 
encoded by four bits with each bit designating a base type. So an instance is encoded by a 
vector of 228 bits along with a label indicating a positive or negative instance. 

In the literature of molecular biology, promoter (of prokaryotes) sequences have average 
constitutions of -TTGACA- and -TATAAT-, respectively, located at so-called minus-35 and 
minus-10 regions [14], as shown in Table |^. 

The CFRule model in this study had 3 channels, which were the minimal number of 
channels to bring the training error under 0.02 upon convergence. Still, the model is relatively 
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Table 5: The average two-fold cross-validation error rates of the rules learned by C4.5 and 
CFRule, respectively. 



Domain 


C4.5 


CFRule 


Promoters (without 
prior knowledge) 


23.9% 


12.8% 


Hepatitis 


7.1% 


5.3% 



underdetermined because of the low ratio of the number of instances available for training to 
the input dimension. However, unlike our previous approach [9], we did not use any pruning 
strategy. The model had to learn to cope with high dimensionality by itself. The learning 
rate was set to 0.2, and the rule extraction threshold 0.5 (all these are default values). The 
model was trained on the training data under the MCRO strategy and then tested on the test 
data. The stopping criterion for training was the drop of MSE (mean squared error) less than 
a small value per epoch. Rules were extracted from the trained model. 

Cross-validation is an important means to evaluate the ability of learning. Domain validity 
is indicated if rules learned based on some data can be well applied to other data in the 
same domain. In the two-fold cross-validation experiment, the 106 instances were randomly 
divided equally into two subsets. CFRule and C4.5 used the same data partition. The rules 
learned on one subset were tested by the other and vice versa. The average prediction error 
rate on the test set was defined as the cross-validation rule error rate. The cross-validation 
experiment with CFRule was run 5 times, each with a different initial weight setting. The 
average cross-validation error was reported. CFRule had a significantly smaller cross-validation 
rule error rate than C4.5 (12.8% versus 23.9%, respectively), as shown in Table ^. Note that 
the prediction accuracy and the error rate were measured based on exact symbolic match. 
That is, an instance is predicted to be in the concept only if it matches exactly any rule of the 
concept, else it is not in the concept. If, however, prior domain knowledge is used and exact 
symbolic match is not required, the error rate based on leave-one-out can be as low as 2% [7]. 

Both CFRule and C4.5 learned three rules from the 106 instances. The rules are sum- 
marized in Table In the aspect of rule quality, CFRule learned rules of larger size than 
C4.5 under inadequate samples. This is because CFRule tends to keep attributes sufficiently 
correlated with the target concept, whereas C4.5 retains only attributes with verified statisti- 
cal significance and tends to favor more general rules. In terms of domain validity, the rule's 
accuracy based on cross-validation is more reliable than other quality measures. Another in- 
teresting discovery made by CFRule (but not by C4.5) is @-45=A (in rule #3) which plays a 
major role in the so-called conformation theory for promoter prediction [11]. 

The data for this research are available from a machine learning database located in the 
University of California at Irvine with an ftp address at ftp.ics.uci.edu/pub/machine-learning- 
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Table 6: The promoter prediction rules learned from 106 instances by CFRule and C4.5, 
respectively. not. @: at. 





Rule 


DNA Sequence Pattern 


CFRule 


# 1 
#2 
#3 


@-34=G @-33=^G @-12=^G 
@-36=T @-35=T @-31=^C @-12=^G 
@-45=A @-36=T @-35=T 


C4.5 


# 1 
#2 
#3 


@-35=T @-34=G 
@-36=T @-12=A 
(a-36=T @-35=T @-34=T 



Table 7: The hepatitis rules for predicting (bad) prognosis learned from 155 instances by 
CFRule and C4.5, respectively. not. @: at. 





Rule 


Premise 


CFRule 


# 1 


MALE and NO STEROID and ALBUMIN < 3.7 


C4.5 


# 1 


MALE and NO STEROID and ALBUMIN < 3.7 



databases. 

6.2 Hepatitis Prognosis Prediction 

In the data set concerning hepatitis prognosis there are 155 instances, each described by 19 
attributes. Continuous attributes were discretized, then the data set was randomly partitioned 
into two halves (78 and 77 cases), and then cross-validation was carried out. CFRule and C4.5 
used exactly the same data to ensure fair comparison. The CFRule model for this problem 
consisted of 2 channels. Again, CFRule was superior to C4.5 based on the cross-validation 
performance (see Table However, both systems learned the same single rule from the whole 
155 instances, as displayed in Table ^. To learn the same rule by two fundamentally different 
systems is quite a coincidence, but it suggests the rule is true in a global sense. 

7 Conclusions 

If global optimization is a main issue for automated rule discovery from data, then current 
machine learning theories do not seem adequate. For instance, the decision-tree and neural- 
network based algorithms, which dodge the complexity of exhaustive search, guarantee only 
local but not global optimization. In this paper, we introduce a new machine learning theory 
^This data set is an old version previously used in our research work [7]. 
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based on multi-channel parallel adaptation that shows great promise in learning the target 
rules from data by parallel global convergence. The basic idea is that when a model consisting 
of multiple parallel channels is optimized according to a certain global error criterion, each of 
its channels converges to a target rule. While the theory sounds attractive, the main question 
is how to implement it. In this paper, we show how to realize this theory in a learning system 
named CFRule. 

CFRulc is a parallel weight-based model, which can be optimized by weight adaptation. 
The parameter adaptation rule follows the gradient-descent idea which is generalized in a multi- 
level parallel context. However, the central idea of the multi-channel rule-learning theory is 
not about how the parameters are adapted but rather, how each channel can converge to a 
target rule. We have noticed that CFRule exhibits the necessary conditions to ensure such 
convergence behavior. We have further found that the CFRule's behavior can be attributed 
to the use of the CF (certainty factor) model for combining the inputs and the channels. 

Since the gradient descent technique seeks only a local minimum, the learning model may 
well be settled in a solution where each rule is optimal in a local sense. A strategy called multi- 
channel regression-based optimization (MCRO) has been developed to address this issue. This 
strategy has proven effective by statistical validation. 

We have formally proven two important properties that account for the parallel rule- 
learning behavior of CFRule. First, we show that any given rule set can be explicitly and 
precisely encoded by the CFRule model. Secondly, we show that once a channel is settled 
in a target rule, it barely moves. These two conditions encourage the model to move toward 
the target rules. An empirical weight convergence graph clearly showed how each channel 
converged to a target rule in an asynchronous manner. Notice, however, we have not been 
able to prove or demonstrate this rule-oriented convergence behavior in other neural networks. 

We have then examined the application of this methodology to DNA promoter recognition 
and hepatitis prognosis prediction. In both domains, CFRule is superior to C4.5 (a rule- 
learning method based on the decision tree) based on cross-validation. Rules learned are also 
consistent with knowledge in the literature. 

In conclusion, the multi-channel parallel adaptive rule-learning theory is not just theoreti- 
cally sound and supported by computer simulation but also practically useful. In light of its 
significance, this theory would hopefully point out a new direction for machine learning and 
data mining. 
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